Data used in this work is House Sale Prices in Ames, Iowa, USA from 2006-2010, downloaded from Kaggle : https://www.kaggle.com/c/house-prices-advanced-regression-techniques (train.csv) .

library(tidyverse)
df_ori=read.csv('train.csv')
df_ori=tbl_df(df_ori)
df_ori

Data structures :

str(df_ori)
## tibble [1,460 x 81] (S3: tbl_df/tbl/data.frame)
##  $ Id           : int [1:1460] 1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int [1:1460] 60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr [1:1460] "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int [1:1460] 65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int [1:1460] 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr [1:1460] "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr [1:1460] NA NA NA NA ...
##  $ LotShape     : chr [1:1460] "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr [1:1460] "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr [1:1460] "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr [1:1460] "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr [1:1460] "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr [1:1460] "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr [1:1460] "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr [1:1460] "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr [1:1460] "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr [1:1460] "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int [1:1460] 7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int [1:1460] 5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int [1:1460] 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int [1:1460] 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr [1:1460] "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr [1:1460] "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr [1:1460] "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr [1:1460] "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr [1:1460] "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int [1:1460] 196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr [1:1460] "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr [1:1460] "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr [1:1460] "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr [1:1460] "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr [1:1460] "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr [1:1460] "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr [1:1460] "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int [1:1460] 706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr [1:1460] "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int [1:1460] 0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int [1:1460] 150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int [1:1460] 856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr [1:1460] "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr [1:1460] "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr [1:1460] "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr [1:1460] "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int [1:1460] 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int [1:1460] 854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int [1:1460] 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int [1:1460] 1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int [1:1460] 0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int [1:1460] 2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int [1:1460] 1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int [1:1460] 3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int [1:1460] 1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr [1:1460] "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int [1:1460] 8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr [1:1460] "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int [1:1460] 0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr [1:1460] NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr [1:1460] "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int [1:1460] 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr [1:1460] "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int [1:1460] 2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int [1:1460] 548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr [1:1460] "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr [1:1460] "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr [1:1460] "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int [1:1460] 0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int [1:1460] 61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int [1:1460] 0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int [1:1460] 0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr [1:1460] NA NA NA NA ...
##  $ Fence        : chr [1:1460] NA NA NA NA ...
##  $ MiscFeature  : chr [1:1460] NA NA NA NA ...
##  $ MiscVal      : int [1:1460] 0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int [1:1460] 2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int [1:1460] 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr [1:1460] "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr [1:1460] "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int [1:1460] 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
df_ori %>% is.na() %>% colSums()
##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             0           259             0 
##        Street         Alley      LotShape   LandContour     Utilities 
##             0          1369             0             0             0 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             0             0             0             0             0 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             0             0             0             0             0 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##             0             0             0             0             0 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##             8             8             0             0             0 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##            37            37            38            37             0 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##            38             0             0             0             0 
##     HeatingQC    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF 
##             0             0             1             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             0             0             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             0             0             0             0             0 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             0             0           690            81            81 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##            81             0             0            81            81 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##             0             0             0             0             0 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##             0             0          1453          1179          1406 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             0             0 
##     SalePrice 
##             0

Let’s exclude Alley, FireplaceQu, PoolQC, Fence, MiscFeature since they have many missing values. Id column also need to be removed because it won’t give us any useful informations.

df_ori=df_ori %>% select(-c(Id, Alley, FireplaceQu, PoolQC, Fence, MiscFeature))

Question 1 : Getting insights from SalePrice distribution

SalePrice is the target variable of this dataset and since there is no NA values in this column thus it can be directly plotted or analyzed.

library(patchwork)
p1=df_ori %>% 
  ggplot(aes(x=SalePrice))+geom_histogram()+theme_bw()
p2=df_ori %>% 
  ggplot(aes(x=SalePrice))+geom_boxplot()+theme_bw()+theme(axis.text.y = element_blank(),axis.ticks.y=element_blank())
p1+p2+plot_annotation(title='The Histogram and Boxplot of SalePrice Variable')

summary(df_ori$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

From the statistic summary of SalePrice, 50% of SalePrices from 2006-2010 in Ames, Iowa ranged from 129975-214000 $ meanwhile the maximum value is very far from that range, i.e 755000, indicating that the distribution of SalePrices is likely to be positively skewed (proven to be true by the histogram and the boxplot). Moreover, these results are quite intuitive, such as many outliers detected in the right tail side of the boxplot indicating that there are exceptional houses valued so expensively and much higher than Ames’ standard house prices. From the author point of view, this phenomenon is common for most cities/states in the world since each city tends to have its own elite neighborhood, for example Beverly Hills in California. Other words to explain this phenomenon is there are no limitations for how luxurious or expensive a house can be but of course there are standards for a building can still be stated as house.

Question 2 : Find 5 variables that are highly correlated to the SalePrice variable

This question will be answered by plotting the pearson correlation heatmap for all numerical variables. However, please note that not all numerical variables can be included in the calculation since some of them have NA values. To avoid any complex procedures, first let’s exclude all numerical variables that contain NA values.

2.1 Pearson Correlation

library(ggcorrplot)
df_ori %>% 
  select_if(~ !any(is.na(.))) %>% select_if(is.numeric) %>% cor() %>% ggcorrplot(lab=TRUE,type='lower',title='Pearson Correlation Heatmap')

Based on the result, it can be easily seen that there are few variables that highly correlated to the SalePrice such as :

  • OverallQual with 0.79 : Rating about the overall material and finish of the house , scaled from 1 (very poor) - 10 (very excellent)
  • GrLivArea with 0.71 : Above grade (ground) living area square feet, emphasizes any livable spaces above ground that are finished and accessible with heating and ventilation. This definition includes bedrooms, bathrooms, kitchens, dining rooms and living rooms, per guidelines laid out by the American National Standard Institute (ANSI).
  • GarageCars with 0.64 and GarageArea with 0.62 : Both variables measure the area of the garage. The first one is the measurement of garage area in car capacity and the latter is in square feet. That’s why these variables are highly correlated to each other, i.e 0.88, indicating that one of them is possibly redundant.
  • TotalBsmtSF with 0.61 : Total basement area in square feet
  • 1stFlrSF with 0.61 : 1st floor area in square feet. Although there are houses with 0 TotalBsmtSF (no basement) but do have 1st floor, correlation matrix shows that these variables, TotalBsmtSF and 1stFlrSF, are highly correlated with 0.82 score. This value for me is reasonable since houses with larger basement area tend to have larger 1st floor area too.

Other variables that do have moderate correlation score are FullBath with 0.56 (the number of full bathroom within the house), TotRmsAbvGrd with 0.53 (total rooms above grade except bathroom), and YearBuilt with 0.52 (original construction year).

2.2 Spearman Correlation

df_ori %>% 
  select_if(~ !any(is.na(.))) %>% select_if(is.numeric) %>% cor(method = 'spearman') %>% ggcorrplot(lab=TRUE,type='lower',title='Spearman Correlation Heatmap')

While Pearson is sensitive to the existence of the outliers, Spearman is less but also applicable and more preferable when there are one or few ordinal variables in the correlation matrix. To gain more insights from the data, i think it’s not hurt to include both of this method in our analysis. The result is not much different from the previous one, when some variables such as OverallQual, GrLivArea, GarageCars and GarageArea are still considered highly correlated with the SalePrice variable. However some of the differences are the correlation score for YearBuilt (0.65) and Fullbath (0.64) are valued higher than TotalBsmtSF (0.6) and 1stFlrSF (0.58).

2.3 Conclusion

To validate the results, we will plot each variable above.

df_ori %>% ggplot(aes(y=SalePrice,fill=OverallQual))+geom_boxplot()+theme_bw()+theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())+facet_grid(~OverallQual)+labs(title = 'SalePrice vs OverallQuall')

  • One can clearly see that SalePrice tend to get higher as the OverallQuall increases
p1=df_ori %>% ggplot(aes(x=GrLivArea,y=SalePrice,color=GrLivArea))+geom_point()+theme_minimal()+geom_smooth(method = 'lm')+labs(title='SalePrice vs GrLivArea')
p2=df_ori %>% ggplot(aes(y=SalePrice,fill=GarageCars))+geom_boxplot()+theme_bw()+theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())+facet_grid(~GarageCars)+labs(title = 'SalePrice vs GarageCars')
p3=df_ori %>% ggplot(aes(x=GarageArea,y=SalePrice,color=GarageCars))+geom_point()+theme_minimal()+geom_smooth(method = 'lm')+labs(title='SalePrice vs GarageArea')
p4=df_ori %>% mutate(GarageCarsChara=as.character(GarageCars)) %>% ggplot(aes(x=GarageCarsChara))+geom_bar(fill='steelblue',stat='count')+theme_minimal()+labs(title='Count of GarageCars',x='GarageCars')
(p1+p4)/(p2+p3)

  • From the SalePrice vs GrLivArea plot, we can clearly see the positive correlation between SalePrice and GrLivArea, however there are also anomalies such as two houses with large GrLivArea (right lower) come with unnatural prices.
  • The price distribution tends to get higher as the GarageCars increases but it unnaturally goes down for 4 car capacity-garage. However when we look at the count bar graph, the amount of houses with 4 GarageCars is really small compared to others. Again, because the price distribution tends to get higher as the GarageCars increases, we can expect positive linear correlation between GarageArea and the SalePrice since both of these variabels (GarageArea & GarageCars) are highly correlated.
p1=df_ori %>% ggplot(aes(y=SalePrice,fill=FullBath))+geom_boxplot()+theme_bw()+theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())+facet_grid(~FullBath)+labs(title = 'SalePrice vs FullBath')
p2=df_ori %>% mutate(FullBathChara=as.character(FullBath)) %>% ggplot(aes(x=FullBathChara))+geom_bar(stat = 'count',fill='steelblue')+theme_minimal()+labs(title='Count of FullBath number',x='FullBath')
p3=df_ori %>% ggplot(aes(x=YearBuilt,y=SalePrice))+geom_point(aes(color=YearBuilt))+geom_smooth(method = 'lm',color='black')+theme_minimal()+labs(title='SalePrice vs YearBuilt')
p4=df_ori %>% ggplot(aes(x=TotalBsmtSF,y=SalePrice))+geom_point(aes(color=TotalBsmtSF))+geom_smooth(method = 'lm',color='black')+theme_minimal()+labs(title='SalePrice vs TotalBsmtSF')
p5=df_ori %>% ggplot(aes(x=X1stFlrSF,y=SalePrice))+geom_point(aes(color=X1stFlrSF))+geom_smooth(method = 'lm',color='black')+theme_minimal()+labs(title='SalePrice vs 1stFlrSF',x='1stFlrSF')
p6=ggplot()+theme_minimal()+theme(axis.line = element_blank())
(p1+p2+p3)/(p4+p5+p6)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

  • Aside of 0 Full Bathroom, we can see that the increment of Full Bathroom amount from 1-3 will be followed by the increase of SalePrice distribution. Moreover, the count barplot also shows that it’s quite rare for houses in Ames to not have any full bathrooms, which invites us to explore why houses with 0 full Bathroom have almost similar price to 1 full bathroom.
  • It can be interpreted from YearBuilt plot that newer houses tend to have higher SalePrice
  • Since both variables TotalBsmtSF and 1stFlrSF are highly correlated, we can expect quite similar result from both of them, which tell us houses with bigger Basement and 1st floor area tend to have higher SalePrice. There are also anomalies in lower right side that show some houses with large basement/1st floor area come with unnaturally low price.

To sum up all works in this question, we can conclude :

  • 5 Variables that have highest correlation to the SalePrice (by spearman method) are OverallQual, GrLivArea, GarageCars or GarageArea, YearBuilt and FullBath. Moreover, there are still other variables that do have high correlation as well such as TotalBsmtSF and 1stFlrSF.
  • Be careful of false correlation and redundant variables, for example : does the FullBath really affect the SalePrice that much? Or maybe it happens because the correlation between both is influenced by GrLivArea? Since FullBath is actually included in GrLivArea measurement. Meanwhile one of the examples of redundant variable is GarageCars since it’s just scaled version of GarageArea.
  • Despite all the correlation, there are also anomalies found for example in GrLivArea vs SalePrice plot we can see there are 2 houses with large GrLivArea but have unnaturaly low price and there are also houses with large basement or 1st floor are but do have abnormal low price too. The one in GrLivArea vs SalePrice will be analyzed in the next question.

Question 3 : Analyze the relationship between OverallQuall and SalePrice

The OverallQuall is the ordinal variable which determines the rating about the overall material and finish of the house , scaled from 1 (very poor) - 10 (very excellent).

p1=df_ori %>% ggplot(aes(y=SalePrice,fill=OverallQual))+geom_boxplot()+theme_bw()+theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())+facet_grid(~OverallQual)+labs(title = 'SalePrice vs OverallQuall',x='Overall Quality')
p2=df_ori %>% ggplot(aes(x=OverallQual,y=SalePrice))+geom_jitter(aes(color=OverallQual),width = 0.25)+scale_x_discrete(limits=c(1:10))+theme_minimal()+labs(title = 'SalePrice vs OverallQuall',x='Overall Quality')
p1/p2

This variable will be analyzed by plotting the relationship between OverallQuall and SalePrice and it gives result that the SalePrice tends to go higher (the median increases) as the Overall Quality increases . This result considered reasonable as the better quality of materials and finish of the house will also offer higher price. From the plot one also can clearly see intuitive result that houses with Excellent+ and Poor- are exceptional and rare , meanwhile houses with Average-Good quality are frequent.

p1=df_ori %>% ggplot(aes(x=OverallQual))+geom_histogram(binwidth = 1,fill='steelblue')+theme_minimal()+labs(title='Distribution of OverallQual',subtitle = 'The distribution of OverallQual nearly fits the normal distribution')
p1

summary(df_ori$OverallQual)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.000   6.099   7.000  10.000

Half of the houses in Ames, Iowa are rated 5 (Average) to 7 (Good).

Question 4 : What makes newer houses get higher prices?

In question 2, we concluded that one of the strong predictors for SalePrice is YearBuilt which is none other than the original construction date. The plot of YearBuilt vs SalePrice also shows that newer houses tend to get higher prices. However, are we sure of this insight? Or maybe there are actually other reasons that make newer houses get higher prices? To evaluate this question, let’s take a look at the previous Spearman correlation heatmap to check other variables that correlate with YearBuilt :

Let’s evaluate those variables one by one :

p1=df_ori %>% ggplot(aes(y=YearBuilt,fill=OverallQual))+geom_boxplot()+theme_bw()+theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())+facet_grid(~OverallQual)+labs(title = 'YearBuilt vs OverallQuall',x='Overall Quality')
p2=df_ori %>% ggplot(aes(x=OverallQual,y=YearBuilt))+geom_jitter(aes(color=OverallQual),width = 0.25)+geom_smooth(method='lm',color='black')+scale_x_discrete(limits=c(1:10))+theme_minimal()+labs(title = 'YearBuilt vs OverallQuall',x='Overall Quality')
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
p3=df_ori %>% ggplot(aes(x=YearBuilt,y=SalePrice))+geom_point(aes(color=OverallQual))+geom_smooth(method='lm',color='black')+theme_minimal()+labs(title='SalePrice vs YearBuilt')
p4=ggplot()+theme_minimal()
(p1+p3)/(p2+p4)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

p1=df_ori %>% ggplot(aes(y=YearBuilt))+geom_boxplot()+theme_bw()+theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())+facet_grid(~GarageCars)+labs(title = 'YearBuilt vs GarageCars',x='GarageCars')
p2=df_ori %>% ggplot(aes(x=GarageCars,y=YearBuilt))+geom_jitter(aes(color=cut_number(YearBuilt,4)),width = 0.25)+scale_x_discrete(limits=c(0:4))+theme_minimal()+labs(title = 'YearBuilt vs GarageCars',x='YearBuilt Binning',color='YearBuilt Binning (Equal Amount)')
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
p3=df_ori %>% mutate(GarageCarsChara=as.character(GarageCars)) %>%
  ggplot(aes(x=YearBuilt,y=SalePrice))+geom_point(aes(color=GarageCarsChara))+geom_smooth(method='lm')+theme_minimal()+scale_color_manual(values = c("0" = "purple","1"="orange","2"="steelblue","3"="green","4"="brown")) +labs(title='SalePrice vs YearBuilt',color='GarageCars')
p4=df_ori %>% ggplot(aes(x=GarageArea,y=YearBuilt))+geom_point(aes(color=cut_number(YearBuilt,4)))+theme_minimal()+labs(color='YearBuilt Binning (Equal Amount)',title='YearBuilt vs GarageArea')
(p1+p3)/(p2+p4)
## `geom_smooth()` using formula 'y ~ x'

p1=df_ori %>% mutate(FullBathChara=as.character(FullBath)) %>%
  ggplot(aes(x=YearBuilt,y=SalePrice))+geom_point(aes(color=FullBathChara))+geom_smooth(method='lm')+theme_minimal()+scale_color_manual(values = c("0" = "purple","1"="orange","2"="steelblue","3"="green")) +labs(title='SalePrice vs YearBuilt',color='FullBath')
p2=df_ori %>%
  ggplot(aes(y=YearBuilt))+geom_boxplot()+facet_grid(~FullBath)+theme_bw()+theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())+labs(title='YearBuilt vs FullBath')
p3=df_ori %>% ggplot(aes(x=FullBath,y=YearBuilt))+geom_jitter(aes(color=cut_number(YearBuilt,3)))+theme_minimal()+labs(title='YearBuilt vs FullBath',color='YearBuilt Binning (Equal Amount)')
p4=df_ori %>% ggplot()+theme_minimal()
(p2+p1)/(p3+p4)
## `geom_smooth()` using formula 'y ~ x'

Question 5 : Analyze two ‘Haunted Houses’ in GrLivArea vs SalePrice scatter plot!

library(gghighlight)
p1=df_ori %>% ggplot(aes(x=GrLivArea,y=SalePrice))+geom_point(color='red',shape=20,size=3)+theme_minimal()+labs(title='SalePrice vs GrLivArea')+gghighlight(GrLivArea > 4500,unhighlighted_params = aes(fill = NULL, alpha = 0.5,shape=18))+labs(subtitle = 'There are 2 houses that do have large GrLivArea but priced unnaturally low')+geom_text(aes(y=SalePrice*1.2,label='haunted?'))
p1

One can easily see that houses in dataset commonly have GrLivArea ranged from 1000-2500 sqft. Houses with GrLivArea > 4000 are exceptional and very rare to be found. However, there are 2 houses that will become our observation here (I highlight these houses with red color) because they have unnatural low price for their large area. They are easily recognized as the largest above grade living area in this dataset (>4500 sqft), but how come their price are below 200000 $ given that GrLivArea is strongly correlated with the SalePrice?

df_ori %>% filter(GrLivArea>4500)

From SaleType and SaleCondition, we accept informations such as the houses were actually just constructed and sold at the time (2007 and 2008) but hadn’t been completed yet when last assessed. Since they are new and just constructed, i do not think these houses give such frightening vibes hence people assumed they are haunted. Given they also have huge GrLivArea and 10/10 OverallQual , other important numerical variables such as GarageCar, FullBath, TotalBsmtSF, etc should also have good value. Other than numerical variables, I think categorical variables must be analyzed too, aren’t they? To do that, I will create a function to plot all categorical variables vs SalePrice distribution simultaneously. I will also highlight the category related to the observed houses to make the plots are easier to read.

library(Rmisc)
library(grid)
Plotme=function(vector_column){
  p=list()
  for (i in 1:length(vector_column)){
    highlight_category=df_ori[df_ori$GrLivArea>4500,][vector_column[i]]
    p[[i]]=ggplot(df_ori)+geom_boxplot(aes_string(x=vector_column[i],y='SalePrice'))+theme_minimal()+theme(axis.text.x = element_text(angle=90))+labs(title=paste('SalePrice vs',vector_column[i]),subtitle = paste('Highlight :',highlight_category[1,1],'and',highlight_category[2,1]))
  }
  multiplot(plotlist=p,cols=4)
}
df_ctg=df_ori %>% 
  select_if(~!any(is.numeric(.))) #Select categorical data
ctg=colnames(df_ctg) #categorical variables
Plotme(ctg)

From SalePrice vs SaleCondition, one can see that Partial are roughly higher than others hence we can’t assume that the unnatural low prices come from it. SalePrice vs SaleType plot also supports this statement by showing the price of new houses tend to be higher. Unfortunately, I can’t extract much useful informations from other plots that may answer our problem since almost all category variables do not show any clear influences on SalePrice and the observed houses also do not have any bad quality reviews. However, I suspect 1 variable that may can help us answer this question, namely Neighborhood. The Neighborhood of our observed houses is Edwards.

library(ggridges) #library to plot the ridgline plot
df_ori %>% 
  ggplot(aes(x=SalePrice/1000,y=Neighborhood,fill=Neighborhood))+geom_density_ridges()+theme_minimal()+labs(x='Sale Prices in 1000$',y='Neighborhood')+theme_ridges(font_size = 11,center_axis_labels = TRUE)+theme(legend.position = 'none')+labs(title='Sale Price for Each Neighborhood',subtitle='Edwards seems to have low price distribution compared to others')

Let’s sort the Neighborhood from the lowest SalePrice median.

x=df_ori %>% #Neighborhood with low price
  group_by(Neighborhood) %>% 
  dplyr::summarise(count=n(),medianSalePrice=median(SalePrice),meanQuall=mean(OverallQual)) %>% arrange(medianSalePrice)
che=x[1:10,'Neighborhood'] #save 10 neighborhoods with the lowest SalePrice median
che=che[[1]]
x

After sorting the median of Price Distribution, we can confirm that Edwards is one of the Neighborhoods with the lowest median of sale price distribution (5th), i.e 121750, roughly half the price of the 3 most elite Neighborhood such as StoneBr (278000), NoRidge (301500), and NridgHt (315000).To prove the influences of Neighborhood, let’s compare the SalePrice vs GrLivArea trendline between 10 lowest neighborhood and others neighborhood.

df_ori %>% ggplot(aes(x=GrLivArea,y=SalePrice,color=Neighborhood %in%che))+geom_point()+geom_smooth(method='lm')+theme_minimal()+scale_color_discrete('10 Neighborhood with the lowest price')+labs(title='GrLivArea vs SalePrice',subtitle='10 Neighborhood with the lowest price have smaller trendline slope')
## `geom_smooth()` using formula 'y ~ x'

From above graph, I think we have found one answer to our question which is Neighborhood plays a big role in the measurement of houses sale price or in other words, these houses have unnaturally low prices because they are located in 10 Neighborhood with the lowest price.

But, is it all?

I am not sure if Neighborhood is the one and only answer, but I am not sure either which (category variables other than neighborhood) that also ‘play’ here. However, I do have other suspicions and this might be a controversial take from me. Let’s visualize the SalePrice vs YrSold using jitter plot.

df_ori %>% ggplot(aes(x=YrSold,y=SalePrice))+geom_jitter(color='gray')+geom_smooth(color='black',method='lm')+theme_minimal()+labs(title='Great Recession Era',subtitle='Great Recession starts from December 2007 - June 2009')
## `geom_smooth()` using formula 'y ~ x'

According to https://www.investopedia.com/terms/g/great-recession.asp , The Great Recession was the sharp decline in economic activity during the late 2000s. It is considered the most significant downturn since the Great Depression. The term Great Recession applies to both the U.S. recession, officially lasting from December 2007 to June 2009, and the ensuing global recession in 2009. The economic slump began when the U.S. housing market went from boom to bust, and large amounts of mortgage-backed securities (MBS’s) and derivatives lost significant value. In Ames however, we get interesting result that that the prices were stagnant during 2006-2010. These years was surely a nightmare for all real estate inventors because the prices didn’t raise year after year. Coincidently, our 2 observed houses were sold exactly at the beginning of the recession, i.e October 2007 and January 2008. So, I think right now we understand the situation back at the time.

Other thing that i want to emphasize is the Supply vs Demand concept. Let’s take a look again at the jitter plot of YrSold vs GrLivArea.

p1=df_ori %>% ggplot(aes(x=YrSold,y=GrLivArea))+geom_jitter(color='gray')+geom_smooth(method='lm',color='black')+theme_minimal()+labs(title='YrSold vs GrLivArea')
p2=df_ori %>% ggplot(aes(y=GrLivArea))+geom_boxplot()+theme_minimal()+labs(title='Distribution of GrLivArea')+theme(axis.text.x = element_blank())
p1+p2
## `geom_smooth()` using formula 'y ~ x'

q3=quantile(df_ori$GrLivArea,0.75)
q1=quantile(df_ori$GrLivArea,0.25)
out_lim=q3+1.5*(q3-q1)
n=df_ori %>% filter(GrLivArea<=out_lim) %>% select(YrSold) %>% summarise_all(funs(n()))
print(paste('The minimum GrLivArea is :',min(df_ori$GrLivArea)))
## [1] "The minimum GrLivArea is : 334"
print(paste('The mean of GrLivArea is :',mean(df_ori$GrLivArea)))
## [1] "The mean of GrLivArea is : 1515.46369863014"
print(paste('The median of GrLivArea is :',median(df_ori$GrLivArea)))
## [1] "The median of GrLivArea is : 1464"
print(paste('Percentage of houses exclude outlier is ',as.integer(n[[1]])/1460))
## [1] "Percentage of houses exclude outlier is  0.978767123287671"
print(paste('Q3 + 1.5IQR =',out_lim))
## [1] "Q3 + 1.5IQR = 2747.625"

By simple calculation, we know that 97.87% of houses sold within 2006-2009 have GrLivArea < 2747.625 sqft. In other words, from our sample, we know that in Ames, 97.87% demands during that period are houses with GrLivArea ranged from 334-2748 sqft. So let’s pretend back at the time that you want to sell a house : Do you want to take such a big risk, sell a quite large houses with GrLivArea > 4500 sqft by quite expensive price, located in Edwards, at the beginning of the Great Recession? I think the answer is simply no, isn’t it?

To end this long explanation, I want to engineer a new feature, called Price_Area, which is none other than SalePrice/GrLivArea. Based on my hypotheses, SalePrice/GrLivArea will decrease after reaching a certain point of GrLivArea (house will be less valuable if it is too large), so the slope of Price_Area linear trendline is negative as GrLivArea increases. To get stronger evidence, i will exclude these observed houses from the plot so the trendline won’t be affected by these ‘haunted’ houses.

df_cad=df_ori %>% mutate(Price_Area=SalePrice/GrLivArea)
df_cad %>% filter(GrLivArea < 4500) %>% ggplot(aes(x=GrLivArea,y=Price_Area))+geom_point()+geom_smooth(method='lm')+facet_grid(cols=vars(YrSold))+theme_bw()
## `geom_smooth()` using formula 'y ~ x'

As I expected, the slope will be negative and reach its peak during 2007-2008 (the climax of Great Recession). At 2006 and 2009 the slope is very close to 0 and return to be negative at 2010 (please note that the data was recorded until July 2010).

This is just my opinion, so please correct me if I am wrong. After all, I think these houses absolutely are not haunted, maybe it was just such a right decision to sell them cheaply back at the time.

Question 6 : Get 1 interesting insight from the dataset

I got this insight when I was doing the analysis of the previous number. Let’s plot MoSold vs the amount of sales during 2006-2010.

df_ori %>% ggplot(aes(x=MoSold))+geom_bar(stat = 'count')+facet_grid(cols=vars(YrSold))+coord_flip()+theme_bw()+scale_x_discrete(limits=c(1:12))+labs(title='Count of Sales for Each Month during 2006-2010',subtitle = 'The amount of sales always reach its peak at June - July')
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?

From above graph, one can clearly see that more houses are sold in June-July period. For your information, June usually is the beginning of the summer season and this is where season plays a big role. Summer is always be a busy season for house hunting. It is the season of weddings, holiday and the time when families prepare to send their children to new schools. These are three popular reasons for moving home or buying a new house. There are already many articles about this on the internet. If you are interested, you can read one of them : https://slate.com/culture/2008/09/an-economics-mystery-why-houses-cost-more-in-summer-has-finally-been-solved.html .

How about the prices? Well remember back at the time, The Great Recession happened.

mosold_chara=as.character(c(1:12))
df_ori %>% ggplot(aes(x=as.character(MoSold),y=SalePrice))+geom_boxplot()+scale_x_discrete(limits=mosold_chara)+theme_minimal()+labs(title='Price vs MoSold',subtitle = 'Price tends to be stagnant throughout the month',x='Month',y='SalePrice')